Whole-genome risk prediction of common diseases in human preimplantation embryos

Preimplantation genetic testing (PGT) of in-vitro-fertilized embryos has been proposed as a method to reduce transmission of common disease; however, more comprehensive embryo genetic assessment, combining the effects of common variants and rare variants, remains unavailable. Here, we used a combination of molecular and statistical techniques to reliably infer inherited genome sequence in 110 embryos and model susceptibility across 12 common conditions. We observed a genotype accuracy of 99.0–99.4% at sites relevant to polygenic risk scoring in cases from day-5 embryo biopsies and 97.2–99.1% in cases from day-3 embryo biopsies. Combining rare variants with polygenic risk score (PRS) magnifies predicted differences across sibling embryos. For example, in a couple with a pathogenic BRCA1 variant, we predicted a 15-fold difference in odds ratio (OR) across siblings when combining versus a 4.5-fold or 3-fold difference with BRCA1 or PRS alone. Our findings may inform the discussion of utility and implementation of genome-based PGT in clinical practice.


Background
Whole genome reconstruction as discussed in the manuscript involves accurate determination of a subset of genotypes derived from microarray measurements from single or few cell biopsies from embryos preimplantation ( Figure N1A; "PS Embryo Genotypes"; Table N1). "Parental Support", also referred to as the phasing algorithm in this supplement, is a method of combining SNP array measurements from multiple embryos and the parents along with recombination frequencies from the HapMap database to enable accurate prediction of chromosome copy numbers, insertions and deletions, embryo genotypes, parent haplotypes as well as embryo parent haplotype origin hypotheses. We describe the process in US Patent 8515679_B2. A summary of the method with certain updates follows. Note that phasing and reconstruction is completed only for euploid chromosomes. Figure N1A. Whole genome reconstruction involves two sources of data. 1) Whole genome sequencing of prospective parents and 2) SNP microarray genotyping of sibling embryos. Limited DNA in embryo biopsy requires amplification and results in inaccuracies in sequence. Figure N1B. Allele measurements at each SNP are color-coded based on the parental haplotype of origin in this example. The Parental Support algorithm uses an HMM that accounts for measurements on sibling genotypes as well as parental genotypes to improve accuracy across several hundred thousand positions. The output of this approach are PS Parental Haplotypes and PS Embryo Genotypes at Illumina CytoSNP12b array locations.

Parental Microarray Data
Mother (MD) and father (FD) microarray data, with genotypes marked as AA,AB,BB (A=Ref,B=Mut).

Embryo Microarray Data
Up to k Embryo microarray measurements (ED1,...,EDk), stored as intensity values on the "A" channel and "B" channel.

PS Parental Haplotypes
A best estimate of each phased parental genotype, denoted as (genotype at haplotype1,genotype at haplotype 2). E.g. phased genotype AB signifies that A comes from the first haplotype and B from the second. For each position on the microarray, the possible phased parental haplotypes are {AA, AB, BA, BB}.

Embryo Parental Origin Hypothesis
The particular haplotypes inherited from the specific parent. Denoted as H1 or H2, depending on which haplotype the embryo inherited from a parent. E.g. MH1,FH2 means that the embryo has inherited its mother's haplotype 1 and father's haplotype 2.

PS Embryo Genotype
A direct combination of mother, father haplotypes and parental origin hypothesis. Calculated only on euploid segments of the genome. E.g. if parent haplotypes are (BA,AB) and parent origin hypotheses are (H1,H2), the embryo inherits a B from mother (H1 from BA) and a B from father (H2 from AB), so the PS embryo genotype is BB.

Simplified example
To conceptually illustrate how noisy genetic measurements from sibling embryos and parents can be leveraged to increase accuracy, we start with a simplified approach on one chromosome, assuming a) 4 embryos b) accurate parental genotypes c) no meiotic recombinations in the parents and d) noise in the embryo genotypes and only one chromosome inherited from each parent. .
We introduce the concept of "Parent Context", which is the combination of mother and father genotypes at a particular site/SNP, denoted as (mother genotype|father genotype). Specific contexts will inform the PS algorithm differently. Table N2 presents the possible scenarios of SNP parent contexts and the way in which they inform the PS algorithm.  Table N3. Sample Data and PS output. Red letters indicate scenarios where PS Embryo Genotypes differ from the raw embryo genotypes.
STEP 1 Set homozygous parents and unambiguous parent contexts on SNPs 1,2, so correcting mendelian errors in e3 on SNP 1 and e4 on SNP 2. By symmetry, we also set the phase (PS Parental Haplotype) of the first heterozygous mother (SNP 3) and father (SNP 4) to AB.
STEP 2 Compute most likely embryo hypotheses for all SNPs with currently phased parents --SNPs 3,4 here. We present possible parent origin hypothesis scenarios, and choose the most likely scenario for each embryo, as seen in Table N4A.   Table N4B.
Step 3 scenarios In this simplified case both maternal scenarios are equally likely on SNP 8. In the case of embryo data given by microarray measurements, we will get one scenario more likely than the other with confidence adjusted accordingly.

STEP 4
Phase parents at sites where both parents are ambiguous, using embryo genotypes and putative parent origin hypotheses for E1 through E4 (determined in Step2). This results in the most likely estimate of PS Parental Haplotypes for SNPs 5 and 7. PS finds the best matches for both SNPs and corrects E1 on SNP 7, as seen in Table N4C.

Hidden Markov Model and related parameters
The full implementation of Parental Support supporting meiotic crossovers involves a Hidden Markov Model (HMM) with a forward-backward (FBA) algorithm implemented.
Background: A Hidden Markov Model (HMM) is a statistical Markov model in which the system being modeled is assumed to be a Markov process Xt, through "time" t, with unobservable, i.e. hidden states {x}. The approach assumes that there is another process Yt, with observable states {y}, whose behavior through time depends on X ( Figure N2A and N2B). The goal is to learn about X by observing Y. In an HMM we assume that for each time instance t, the conditional distribution of Yt depends only on Xt, via probability P(y|x)=P(Yt=y|Xt=x). This probability is the emission probability.
The probability of the observable sequence Y=(Y1,...,Yn) can be written by Bayes rule as P(Y)=∑XP(Y|X)P(X). For an HMM as in Figure N1A, we are interested in the posterior probability P (Xt=x|(y1,..,yn)), i.e. a probability of an unobservable state x at time t, given observed states (y1,..,yn). The forward algorithm calculates the joint probability of a hidden state x and (y1,..,yt) A(x,t)=P(Xt=x,y1,..,yt) as A(x,t)=P(Yt=yt|Xt=x)*∑zP(x|z,t)*A(z,t-1) thus reducing the problem of order t to the problem of order t-1, as seen in Figure N1B. P(x|z,t) is referred to as a hidden state transition probability at time t. Posterior probability of any hidden state x at time n is then P(x|(y1,..,yn))~A(x,n).

Specific Implementation for Embryo Measurements (Parental Support):
This statistical model incorporates the fact that an embryo will inherit alleles from the same parent homolog on consecutive SNPs, unless a meiotic recombination (with probability estimated in the HapMap database) has occurred between the two SNPs. The joint distribution on genotype probabilities thus combines the array data, the individual embryo genotypes suggested by the array data, and the parent haplotyping that could produce those distributions of genotypes among various embryos. Consecutive SNPs represent "time" t, with additional definitions below as shown in Figure N3 and Table N5. The approach is applied to each full chromosome separately, at all sites on the array. The number of SNPs per chromosome ranges from 4.3K (chrom 21) to 23.7K (chrom 2). In an advance over Kumar et al. 2015 the approach is run across the entire chromosome instead of smaller regions of the genome. This modified approach allows for crossovers within, as well as between bins, as well as inference of problematic genome sections.

Transition probability
The product of haplotype prior frequencies (assuming no parent SNP linkage) and origin hypotheses transition probabilities. Specifically, for states x=(MG,FG,MHj,FHj,j=1.. where EGj is the jth embryo genotype, a direct combination of parent haplotypes MG,FG and jth parent origin hypotheses (MHj,FHj), as mentioned before. P(MD|MG), P(FD|FG) are parent data genomic emission models and P(ED|EG) is the embryo data emission model, further discussed below.

Output: PS Parental Haplotype
Parent haplotypes probability is the marginal of the joint probability P(x|y), for all states having specific parent haplotype, i.e. P(MG|y)=∑x∈MGP(x|y). For each parent, we derive the best answer g* as the one maximizing this probability, i.e. g*=argmaxg P(g|y), with confidence P(g*).

Output: Parental Origin Hyp.
The state maximizing the marginal parent origin hypothesis probability P(H|y)=∑x∈HP(x|y).

Calculating Transmission Probabilities
In the simplified approach described above, we assumed that all parental haplotypes are inherited in the embryos without recombination. To model the meiotic recombinations between consecutive SNPs we compute the transmission probability from state z at SNP t-1 to state x at SNP t as P(x at SNP t|z at SNP t-1)=P(MG,t)*P(FG,t)*∏j=1,...,kP(MHj|MHzj,t)xP(FHj|FHzj,t), where: -P(MG,t) and P(FG,t) are parent haplotypes population priors at SNP t, derived from a large set of training data and allele frequency public databases. -P(MH|MHz,t), P(FH|FHz,t) are the hypotheses transition probabilities, derived via crossover probabilities between SNPs t-1 and t, from the HapMap database, simulating a chance of meiotic crossover between SNPs. Specifically, P(H1|H1,t)=P(H2|H2,t)=1-ct (no crossover occurred) and P(H1|H2,t)=P(H2|H1,t)=ct (crossover occurred), where ct=crossover probability between SNPs t-1 and t, derived via HapMap database.

Calculating Emission Probabilities (Emission Models)
Noise in the microarray measurements of parent or embryo are accounted for in the "emission model" of the HMM. Specifically, the emission probabilities are the per SNP product of per channel data likelihood given a true genotype G: P(Data|genotype=G)=P(Data on channel A|G)*P(Data on channel B|G). We use two different approaches to modeling channel data likelihood: a simplified discrete emission model and a more complex continuous emission model ( Figure N4). The Discrete Emission Model is defined as the channel independent matrix product: Fdin,dout(g,G)=P(genotype=g|true genotype=G, ADI, ADO)=P(#A(g)|#A(G))*P(#B(g)|#B(G)) parameterized using a drop in rate (ADI) and drop out rate (ADO), based on number of alleles A,B in true genotype G and measured genotype g, as shown in Table N6. Dropin (ADI) and dropout (ADO) rate parameters are fit on a case-by-case basis using microarray intensity data, as follows.
# alleles in G #alleles in g (measured)  First we determine a channel "noise floor" as the 95th percentile of channel array measurements for SNPS where no channel signal should be present, such as SNPS with same homozygous parents, context (AA|AA) for channel B and (BB|BB) for channel A. We have effectively fixed the drop in rate at ADI=5%. Second, we calculate the channel dropout rate (ADO) as the percent channel measurements less than "noise floor" out of all SNPS with guaranteed heterozygous genotype, such as "polar" SNPS with context (AA|BB) or (BB|AA). This process is demonstrated for one of 'case 10' embryos in Figure N3 ("Discrete Response"). Figure N4. Calculating Emission Probability parameters, continuous and discrete approach for one Day 5 embryo (Case 10). Microarray measurements are colored by parental context. Noise floor FA and FB are used in the discrete emission model.

Continuous Emission Model:
In this case, data measurements are modeled using a two-dimensional likelihood P(Data|G)=P(Channel A Measurement|G)*P(Channel B Measurement|G), where each channel likelihood is parameterized via known, continuous distribution for given G. Distribution parameters are fitted in each couple using embryo microarray measurements for parental context resulting in G. The process is demonstrated for one of 'case 10' embryos in Figure N4.

Parameter estimates used in HMM
Genomic Data Emission Model For mother and father microarray data, we adopt discrete emission model with fixed parameter values, such as genomic data ADI rate=0.1% and genomic data ADO=0.15%, determined from a large training set of genomic samples.
Embryo Data Emission Model For embryo Day 3 or Day 5 microarray data, we use either discrete model (simple case), or continuous model (more accurate), with parameters determined on per embryo basis. Mean dropout rates, for Day 3 and 5 embryos in our study, are given in Table N7. A boxplot of estimated allele dropout rates for 90 Day 5 and 20 Day 3 embryos are given in Figure N5. Of note, only euploid chromosomes are displayed. If an embryo contains 23 chromosomes and one is aneuploid, only 22 chromosomes would be included. For simplicity, the Y chromosome was not considered in this analysis.

Parental Support Results
We next examined the percent of total array sites correctly predicted both without parental support (raw genotype) or with parental support (PS embryo genotype). This analysis considered only euploid chromosomes. An average of 97% of the array sites are called with Day 5 embryo biopsies using parental support (PS embryo genotypes before cleaning with population data) vs. 68 % of array sites (n=150,000) without. Similarly, in Day 3 embryos 79% of array sites are called vs. 51% without. The accuracy of these calls is 99.5% for Day 5 embryos and 97% for Day 3 embryos, as opposed to 97.4% (Day 5) and 53.6% (Day 3) for microarray sites, an increase of 2% and 44% respectively. Mean dropout rates, coverage and correct call rates for raw and PS genotypes, along with dropout rates are given for all Day 5 (90 embryos) and Day 3 (20 embryos) in Table N7. Boxplots of rates for genotype kind and Day 5 vs Day 3 are given in Figure N6A and B. Parental Support consistently increased the accuracy and coverage of predictions. Day 3 embryos had a higher drop out rate, lower coverage and lower accuracy compared with Day 5 embryos. PS Embryo genotypes are further processed in our approach of whole genome reconstruction ( Figure  N1A) and below, to increase the number of sites predicted in both Day 3 and Day 5 embryos (see Extended Data Fig. 1).

Genome Prediction in Embryo
We used the PS Parental Haplotypes and PS Embryo Genotypes in combination with whole genome sequencing of each parent as described in the methods section to enable whole genome prediction of each embryo. See example predictions of transmitted segments in Figure N7 below. Briefly, we phased WGS-identified variants (~6 million) into "phased parental genomes" using PS Parental Haplotypes and a population reference panel. We computed which portion of the parent's chromosomes was transmitted to the embryo by comparing the PS haplotype with the PS embryo genotypes and repeated the process across all maternal and paternal chromosomes (see Extended Data Fig. 7-8). Results of the reconstruction approach can be found in Extended Data Fig. 1. Figure N7: Here is a plot of transmitted haplotypes on chromosome 3 across 8 sibling embryos derived from family 5. Transmitted haplotypes were output from parental support and form the basis of the PS Embryo Genotypes at microarray sites. Green and red lines denote parental haplotype 1 and 2 respectively for mother (MH) and father (FH) haplotypes in each embryo. (Regions of some uncertainty are colored yellow).

Supplemental Note 2. WGS with synthetic long read sequencing
Linked read sequencing data was generated for Case IDs 5, 8, 9, and 10 using the TELL-Seq library preparation method. After read alignment and variant calling using the same process described above with the addition of maintaining molecular barcode information for each read, we inferred the molecular phase using HapCut2 34 with default parameters except maxFragments=1000000 and d=40000. We annotated positions with their global allele frequency using the gnomad database 35 .

Supplemental Note 3: Within-family polygenic risk score effect size
To examine within family effects of PRS, we applied a random intercept mixed-effects model, similar to Selzam et al 2019, on a total of 9,000 sibling pairs within the UK Biobank, including two fixed effects to separate within and between family effects: Where W is the within family slope, PRSij is the PRS in individual i and family j, is the average PRS for family j, F is the slope between families, and is a random intercept term. Our analysis did not find a significant difference in breast cancer PRS effect sizes for siblings vs. unrelated individuals. Although the UK Biobank does not have enough siblings to repeat this analysis across all diseases, we anticipate similar findings for most diseases.

Supplemental Note 4: Centering approach for individuals with Ashkenazi Jewish Ancestry
For individuals with AJ ancestry, we found the above method that relies on principal component analysis of 1000 Genomes individuals did not sufficiently correct for ancestry, likely due to low representation of AJ individuals in the 1000 Genomes project. Thus, for individuals with AJ ancestry, we center and standardize using a population of AJ identified in the UK Biobank, following the approach detailed in Prive et al 2021 36 . Specifically, we project data from UK Biobank participants onto principal components calculated using the Khazar dataset from Behar et al 37 .
We then identify the geometric median of the AJ reference individuals and calculate the distance to this center for all UKB individuals. We then chose a threshold such that all AJ individuals in the reference set are included and all other populations excluded. 470 UKB individuals were assigned to this group as AJ. We use the mean and standard deviation of the scores in this group of individuals to center and standardize the raw PRS for AJ individuals in the study.